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Abstract: Kernel estimation techniques, such as mean shift, suffer from one major drawback: the 
kernel bandwidth selection. The bandwidth can be fixed for all the data set or can vary at each points. 
Automatic bandwidth selection becomes a real challenge in case of multidimensional heterogeneous 
features. This paper presents a solution to this problem. It is an extension of [4| which was based 
on the fundamental property of normal distributions regarding the bias of the normalized density 
gradient. The selection is done iteratively for each type of features, by looking for the stability of 
local bandwidth estimates across a predefined range of bandwidths. A pseudo balloon mean shift 
filtering and partitioning are introduced. The validity of the method is demonstrated in the context of 
color image segmentation based on a 5-dimensional space. 
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Estimation a noyau adaptatif dans des espaces 
multidimensionnels heterogenes 



Resume : Les methodes d' estimation a noyau, telles que le mean shift, ont un inconvenient majeur : 
le choix de la taille du noyau. La taille peut etre fixe pour l'ensemble des donnees ou varier en chaque 
point. La selection de cette taille devient vraiment difficile dans le cas de donnees multidimension- 
nelles et heterogenes. Ce rapport presente une solution a ce probleme. II s'agit d'une extension de 
1'algorithme presente dans La taille est choisie iterativement pour chaque type de donnees, en 
cherchant dans un ensemble de tailles predefinies celle qui donne localement les resultats les plus 
stables. La selection iterative necessite Fintroduction d'un algorithme de segmentation mean shift 
base sur Testimateur dit balloon. La methode est validee dans le contexte de la segmentation d'image 
couleur. 

Mots-cles : estimateur a noyau, filtrage mean shift, taille de noyau 
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1 Introduction 

Clustering is an important task in a wide range of applications of computer vision. Many methods 
exist ifTTl . Most of them rely upon some a priori. For example, for methods such as EM, the number 
of clusters must be known beforehand. It can be estimated by optimizing a global criterion. Other 
methods assume known the shape, often elliptical, of the clusters, which is often not sufficient to 
handle the complexity of real images. A method that does not rely on these two priors is the "mean 
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shift" search for the modes of a kernel density estimation. The non-parametric aspect of the approach 
makes it very versatile to analyze arbitrary feature spaces. Hierarchical clustering methods are also 
non-parametric. However, they are computationally expensive and defining the stopping criterion 
is not simple. These reasons explain why the mean shift clustering became recently so popular in 
computer vision applications. 

Mean shift was first introduced by Fukunaga (9) and latter by Cheng Q- It has then been widely 
studied, in particular by Comaniciu |7]|6]|5]. Mean shift is an iterative gradient ascent method used 
to locate the density modes of a cloud of points, i.e. the local maxima of its density. The estimation 
of the density is done through a kernel density estimation. The difficulty is to define the size of the 
kernel, i.e. the bandwidth matrix. The value of the bandwidth matrix highly influences the results of 
the mean shift clustering. 

There are two types of bandwidth matrices. The first ones are fixed for the all data set. At the 
opposite, the variable bandwidth matrices vary along the set and capture the local characteristics of 
the data. Of course, the second type is more appropriate for real scenes. In fact, a fixed bandwidth 
affects the estimation performance by undersmoothing the tails of the density and oversmoothing the 
peaks. A variable bandwidth mean shift procedure has been introduced in 10. It is based on the 
sample point density estimator [ 10]. The estimation bias of this estimator decreases in comparison to 
the fixed-bandwidth estimators, while the covariance remains the same. The choice of a good value 
for the bandwidth matrix is really essential for the variable bandwidth mean shift. Indeed, when 
the bandwidth is not selected properly, the performance is often worse than with a fixed bandwidth. 
Another variable bandwidth estimator is the balloon estimator. It suffers of several drawbacks and 
has therefore never been used in a mean shift algorithm. However, it has been shown in l23l that 
this estimator gives better result than the fixed bandwidth and the sample point estimators when the 
dimensionality of the data is higher than three. Hence, in section [3*. 3. 21 we will propose a new mean 
shift clustering algorithm based on the balloon estimator. 

The bandwidth selection can be statistical analysis-based or task-oriented. Statistical analysis- 
based methods compute the best bandwidth by balancing the bias against the variance of the density 
estimate. Task-oriented methods rely on the stability of the feature space partitioning. For example, a 
semi parametric bandwidth selection algorithm, well adapted for variable bandwidth mean shift, has 
been proposed by Comaniciu in J4j[7]]. It works as follows. Fixed-bandwidth mean shift parti tionings 
are run on the data for several predefined bandwidth values. Each cluster obtained is described by 
a normal law. Then, for each point, the clusters to which it belongs across the range of predefined 
bandwidths are compared. The final selected bandwidth for this point corresponds to the one, within 
the predefined range, that gave the most stable among these clusters. The results obtained for color 
segmentation were promising. However, this method has some limits. In particular, in case of a mul- 
tidimensional data points composed of independent features, the bandwidth for each feature subspace 
should be chosen independently. Indeed, the most stable cluster is not always the same for all the fea- 
ture subspaces. A solution could be to define a set of bandwidths for each domain and to partition the 
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data using all the possible bandwidth matrices resulting from the combination of the different sets. 
However as the dimensions become high and/or if the sets of predefined bandwidths become large, 
the algorithm can become very computationally expensive. 

In this paper we address the problem of data-driven bandwidth selection for multidimensional 
data composed of different independent features (a data point is a concatenation of different, possibly 
multidimensional features, thus living in a product of different feature spaces). As no statistically 
founded method exists for variable bandwidth and for high dimension, we concentrate on a task- 
oriented method, i.e. a method that relies on the stability of the feature space partitionings. Band- 
widths are selected by iteratively applying the stability criteria of [4!] for each different feature space 
or domain. We also introduce a new pseudo balloon mean shift which is better adapted for high 
dimensional feature spaces than the variable bandwidth mean shift of (7). 

We first recall some theory on kernel density estimation (section O and mean shift filtering (sec- 
tion[3]l, and introduce the pseudo balloon mean shift filtering and partitioning (subsection l3.3.U . In 
section|U we present our algorithm for bandwidth selection algorithm in case of multivariate data and 
finally we show results of our algorithm for color clustering and color image segmentation (section 



2 Kernel density estimation 

For the clarity of the paper, we start by reminding several results on fixed and variable bandwidth 
kernel density estimation. 

2.1 Fixed bandwidth estimator 

Given {x (!) }i=i.. n , n points in the <i-dimensional space R d , the non-parametric kernel density estima- 
tion at each point x is given by: 



where Kn is a kernel, and the bandwidth matrix, H, controls the size of the kernel. The shape of the 
kernel is constrained to be spherically symmetric. Equation[T]can also be written as: 



The theory of kernel density estimation indicates that the kernel K must be a bounded function with 
compact support satisfying: 



7(x) = -J>h(x-x«) 




(i) 





(3) 
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where c k is a constant and I is the identity matrix. 

In many cases fixed bandwidth kernel estimators are not a good choice to represent the data. In- 
deed a variable bandwidth is more appropriate to capture the local characteristics of the data. Two 
main variable bandwidth estimators exist. The first one allows the definition of bandwidths at the dif- 
ferent data points and is referred to as the sample point estimator. The second one lets the bandwidth 
vary with the estimation points and is often referred to as the balloon estimator or nearest neighbor 
estimator. 

2.2 Sample point estimator 

The sample point estimator was first introduced by Breiman et al. Ifl2l . It is a mixture of similar 
kernels centered on data points, possibly with different bandwidths. It is defined as: 

1 ™ 

/(X) = -£*H(X«)(X-XW) 

n * — ' ^ ' 

';' (4) 
Ef5p^ A ' (H<xl " ,r ' /2(x - x<,,)) ' 

In ||231 the advantages and drawbacks of this estimator have been studied. The major advantages 
are that it is a density and that a particular choice of H(a; w ) can considerably reduce the bias ifTOl . 
However finding this value for multivariate data is a hard problem not yet solved. A disadvantage is 
that the estimate at a point may be influenced by observations very far away and not just by points 
nearby. In |j23l simulations have shown that this estimator has a very good behavior for small-to- 
moderate sample sizes. It deteriorates in performance compared to fixed bandwidth estimates as the 
sample size grows. 

2.3 Balloon estimator 

The balloon estimator was first introduced by Loftsgaarden and Quensberry fl4l . It is defined as: 

/(x) = il>H( X )(x-X«) 

t (5) 

4§P(xW^ (H(Xrl/2(X - x(i))) - 

This estimator allows a straightforward asymptotic analysis since it uses standard pointwise results 
ifTBl . On the other hand, when applied globally, the estimate typically does not integrate to 1 and thus 
is usually not itself a density, even when K is. In ll23ll the authors have investigated the improvement 
that this estimator allows over fixed bandwidth kernel estimates. For data of fewer than 3 dimensions, 
the improvement seems to be very modest. However the balloon estimator becomes very efficient as 
soon as the number of dimensions becomes higher than 3. 
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2.4 Quality of an estimator 

The quality of an estimator depends on the closeness of / to the target density /. A common measure 
of this closeness is the mean squared error (MSE), equal to the sum of the variance and the squared 
bias: 

MSE(x)=£[(/(x)-/(x)) 2 ] 

= var(/(x))+ [Bias(/(x))] 2 . 

A good estimator has a small bias and a small variance. We detail in this subsection the computation 
of the bias and the variance for the fixed bandwidth estimator. For that purpose, we denote as V/ the 
gradient of function / and as H(f) the Hessian matrix of second partial derivatives. The second-order 
Taylor expansion of /(•) around x ll24l p. 94] is: 

/(x + Sx) = /(x) + <5x T V/(x) + i<Jx T W(/(x))«5x + o(<5x T <5x) . (7) 
Applying to the fixed kernel estimator, it leads to the expectation: 
£(/W)= / ^y I75 /v(H- 1 / 2 (u-x))/(u)du 

= [ J A'(s)ds/(x) + J A(s)s T ds H 1 / 2T V/(x)+ (8) 
J iA(s)(H 1 / 2 s) T H(/(x))(H 1 / 2 s)ds + J A(s)o(s T Hs)ds] . 

Using the kernel properties (equation |6j, the fact that the trace of a scalar is just the scalar, and the 
identity tr(AB) = tr(BA), the bias of the fixed kernel estimator becomes: 



Bias(/(x)) = £(/(x)) - /(x) 

= c fe tr[H 1 / 2T H(/(x))H 1 / 2 ] + J A(s)o(s T Hs)ds 

The variance is 

1 ™ 

var(/(x))=var[-VA H (x-x«)] 

i=l 

= ^172 ( / (^(«)) 2 ds/(x) + J A(s)o(s T Hs)ds) 



(9) 



(10) 



Several other measures exist, mainly the mean integrated squared error (MISE) or the asymptotic 
mean integrated squared error (AMISE). A detailed derivation of these measures can be found in lfT9l 
and l24l. As discussed inl4.ll these measures can be used to select the best value for H. 



3 Mean shift partitioning 

An appealing technique for clustering is the mean shift algorithm, which does not require to fix the 
(maximum) number of clusters. In this section we first remind the definition of kernel profiles and 
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the principle of mean shift filtering and partitioning. As we are interested in variable bandwidth 
estimation, we give the result of J7) in which the mean shift for the sample point estimator was 
developed. We then introduce a novel pseudo balloon mean shift based on the balloon estimator. 

3.1 Kernel profile 

The profile of a kernel K is the function k : [0, oo) — > R such that K(x) = c fe fc(||a;|| 2 ), where c k is 
a positive normalization constant which makes A"(x) integrate to one. Using this profile the fixed 
bandwidth kernel density estimator can be rewritten as: 

1 - 

/(x) = H^^ X(H " 1/2(x_xW)) 

l t (ID 

= ^E fc (ii H - 1/2 ( x - x(l) )ii 2 ) 

1 1 1=1 

Two main kernels are used for mean shift filtering. Using a fixed bandwidth estimator, it can be shown 
|[T9l p,1391 ll2"4l p. 104] that the AMISE measure is minimized by the Epanechnikov kernel having the 
profile: 

!1 - x < a; < 1 
(12) 
x > 1 . 

The drawback of the Epanechnikov kernel is that it is not differentiable at the boundary of its support 
(for x = 1). The second kernel is the multivariate normal one defined by the profile: 

k(x)=exp(~x) (13) 

which leads to the interesting property: 

g{x) = -k'(x) = ~k(x) . (14) 



The normalization for this profile is c k = (2ty) 



-d/2 



3.2 Fixed bandwidth mean shift filtering and partitioning 

Mean shift is an iterative gradient ascent method used to locate the density modes of a cloud of points, 
i.e. the local maxima of its density. The mean shift filtering is well described in Q. Here the theory 
is briefly reminded. 

The density gradient of the fixed kernel estimator (equation[T|i is given by: 



V/(x) = H /(x) m(x) (15) 



where m is the "mean shift" vector, 



, , Eti*W g (||H-i/2(x-xW)P) 

m W = niTT-l/2/ <i)\\\2\ ~ X ' (16) 
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Using exactly this displacement vector at each step guaranties convergence to the local maximum 
of the density Q. A mode seeking algorithm (algorithm [TJ, or mean shift filtering can be derived 
by iteratively computing the mean shift vector. Each computation of this vector leads to a trajectory 
point y (j) . The first trajectory point y (1) is the estimation point x itself while the last point y (tm) is 
the associated mode z . The final partition of the feature space is obtained by grouping together all 
the data points that converged to the same mode (algorithmic. 

Algorithm 1 Mean shift filtering 

Let {x^}i=i,..., n be n input points in the d-dimensional space and {z' l '}j=i,..., n their associated 
modes. For i = 1 . . . n 

1. Initialize j = 1, y (1) = x w . 

2. Repeat 

• = y^' + m(y ( "'' ) ) according to equation[T6l 

• 3 = 3 + 1- 
Until y^- 1 ' =y (j) . 

3. Assign z (i) = y (j) . 



Algorithm 2 Mean shift partitioning 



Let {x^}t=i,...,n be n input points in the d-dimensional space and {zW}i=i,... >n their associated 
modes. 

1 . Run the mean shift filtering algorithm. 

2. Group together all z w which are closer than H, i.e two modes z w and z (j) are grouped together 
if: 

||zW -z0')|| < 1 1 M || . 

3. Group together all x w whose associated mode belongs to the same group. 

Mean shift with normal kernel usually needs more iterations to converge, but yields results that 
are almost always better than the ones obtained with the Epanechnikov kernel. In the sequel we will 
only consider the multivariate normal kernel. With a rf-variate Gaussian kernel, equation[T6lbecomes 

_ ElUxO ex P H^(x,xW H)) _ 
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where 

£> 2 (x,x (i \H) = (x-x w ) T H _1 (x-xW) (18) 
is the squared Mahalanobis distance from x to x w . 

3.3 Variable bandwidth mean shift 

In the sequel we detail the mean shift using the two variable bandwidth estimators. The first version, 
called "variable bandwidth mean shift", is based on the sample point estimator and was introduced 
in 0. The second one is novel, since the balloon estimator has never been used in a mean shift 
algorithm. We will refer to the algorithm as "pseudo balloon mean shift". 

3.3.1 Variable bandwidth mean shift using sample point estimator 

The mean shift filtering using the sample point estimator was first introduced in Q. Using this 
estimator, equationfTTIbecomes: 

EHi |H(xW)|-V 2 xW ex P HJ 2 (x,xW H(x«))) _ 
Er=i|H(x«)|" 1 /2exp(-I^ (X;xW)H (xW))) X - UVj 

As with fixed bandwidth kernel estimator, a mean shift filtering algorithm can be derived based on this 
mean shift vector. The proof of convergence of mean shift filtering using the sample point estimator 
can be found in JTJ. 

3.3.2 Pseudo balloon mean shift 

As mentioned earlier, the balloon estimator /(x) is not always a density (does not always integrate 
to one), and leads to discontinuity problems. Its derivative V/(x) contains terms that depend on 
(x — x (l) ) 2 and of H'(x). Thus there is no closed-form expression for the mean shift vector. To be 
able to develop a mean shift filtering algorithm based on the balloon estimator, several assumptions 
must be made. In the context of mean shift algorithms, the bandwidth function H is only defined 
discretely at estimation points. To turn the estimator into a density and to give a closed form to the 
derivatives, we assume that Vi = 1 . . . n, H'(x (l) ) = 0. Using the kernel profile k, equation[5]evaluated 
at data points becomes, for i = 1 . . . n: 

/V l) ) = ^E | H J))|i/ a fc(l|H(xW) " 1/2(x(3) - xW)l|2) • (20) 
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Since we consider H'(x w ) = 0, its derivative is: 

V/(x«) = V/(xW) 

n 

= ^(xTO^^ 11 ^^"^ - x(l) ) fc (ll H ( xW )" 1/2 ( xb) - xW )H 2 ) 
= n | H (xW)|i/2 H ( xW ) -1 E *(I|H(xW)-^(xW - xW)f )(xW - xM) 

= i[^H(x«)-^ H(xW)( x« _x«)] rfi 1 r H(x( )( ' * - x(t) 

(21) 

A mean shift filtering algorithm can be derived using the last term of previous equation as the mean 
shift vector: 

mix) = ; j-t- X . (22) 

1 ' E;=1^H(x)(x-xO)) 

Previous equation is only valid at the data points. Therefore, if H varies for each trajectory point, the 
mean shift filtering algorithm is not valid and its convergence can not be proved. The solution that 
we propose is to defined a pseudo balloon mean shift where the bandwidth varies for each estimation 
point but is fixed for all trajectory points. This means that the data points influencing the computation 
of a trajectory point are taken with the same bandwidth along all the gradient ascent trajectory. The 
advantage is that the estimate at a point will not be influenced by observations too far away. We then 
take the bandwidth H(x) constant for all trajectory points y (j) corresponding to the estimation point 
x (belonging to the data points). In other words, for a given starting point, the procedure amounts 
to a fixed bandwidth mean shift, with bandwidth depending on the starting point. We call this new 
mean shift pseudo balloon mean shift. The convergence of the pseudo balloon mean shift filtering if 
H(x) T = H(x) is demonstrated in appendix. The pseudo balloon mean shift partitioning algorithm 
is described in Algorithm [3] We use the minimum of the two bandwidths in step 2 to avoid the 
aggregation of two very distant modes. 



4 Bandwidth selection for mixed feature spaces 

Results of mean shift filtering or partitioning always highly depend on the kernel bandwidth H which 
has to be chosen carefully. Various methods for bandwidth selection exist in literature. In particular, 
several statistical criteria, which generally aim at balancing the bias against the variance of an esti- 
mator, have been introduced. They are called statistical-analysis based methods, and can be applied 
to any method based on kernel estimation. Other techniques, only dedicated to clustering, define a 
criteria based on the stability of the clusters. They are called stability based methods. In subsection 
14.11 we present a review of these two types of techniques. Many of these methods have proven to be 



RR n° 6286 



12 



Bugeau & Perez 



Algorithm 3 Pseudo balloon mean shift partitioning algorithm 

Let {x^}i=i,... lT , be n input points in the e?-dimensional space and {z^'}i=i,..., n their associated 
modes. 

1. For i = 1, . . . , n, run the mean shift filtering algorithm from x = x^, with H(x) = H(xW), 
using equation d22l l, 

2. Group together two modes z w and z ' if: 

|| z « < miri(||H(x«)||, ||H(x ( ^)||) . 

3. Group together all x (l) whose associated modes belong to the same group. 



very efficient. Nevertheless, none of them is really adapted to data in high dimensional heterogeneous 
spaces. 

Therefore, in this section we propose an algorithm dedicated to the mean shift partitioning in 
high dimensional heterogeneous space. We assume that the d-dimensional input space can be decom- 
posed as the Cartesian product of P independent spaces associated to different types of information 
{e.g. position, color), also called feature spaces or domains, with dimension d p ,p = 1 . . . P (where 

Ep =1 d P = d). 

4.1 Existing methods for bandwidth selection 

This first subsection present a short review of existing bandwidth selection methods of both types. 
4.1.1 Statistical-analysis based methods 

Statistical methods aim at improving the quality of the kernel estimator. We remind that a good 
estimator is an estimator that has a small bias and a small variance. The quality is usually evaluated 
by measuring the distance (MSE, MISE, AMISE...) between the estimate / and the target density /. 
These measures are of little practical use since they depend on the unknown density function / both 
in the variance and the squared bias term (subsection 12.41 ). However, the definition of the bias and 
the variance leads to the following property: the absolute value of the bias increases and the variance 
decreases as H increases. Therefore to minimize the mean squared error (or any other measure), we 
are confronted with a bias-variance trade-off. 

Several good solutions can be found in literature to find the best value for H. In particular, we can 
mention the "rule of thumb" method EH . the "plug-in" rules lfT7ll20l and the cross validation methods 
|[T7lll22l p. 46]. However, all these techniques have some drawbacks. The "rule of thumb" assumes 
that the density is Gaussian. A practical algorithm based on the plug-in rule for one dimensional data 
can be found in Q. An algorithm for the case of multivariate data is presented in E4l p. 108] but it 
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is difficult to implement. Finally, cross validation methods becomes very computationally expensive 
for a large set of data. 

For variable bandwidth, the most often used method takes the bandwidth proportional to the 
inverse of the square root of a first-order approximation of the local density. This is the Abramson's 
rule 0"). Two parameters must then be tuned in advance: a proportionality constant and an initial 
fixed bandwidth. The proportionality constant influences a lot the result. Also, for multidimensional 
multimodal data, the initial fixed bandwidth is hard to determine. The application of this technique to 
the variable bandwidth mean shift, i.e. to the sample point estimator, has been proposed and discussed 
in 0. It leads to good results on toy examples. Evaluation of partitioning on real data is subjective, 
and it is hard to assert the superiority of such statistical methods. Furthermore, their application to 
high dimensional multimodal data is still an open problem. 

4.1.2 A stability based method 

Methods for bandwidth selection specially dedicated to clustering have also been studied. They are 
based on cluster validation. Many criteria determining the validity of a cluster exist fl6l . For ex- 
ample, some methods evaluate the isolation and connectivity of the clusters. Another criterion is the 
stability. It is based on human visual experience: real clusters should be perceivable over a wide range 
of scales. This criterion has been used in the scale space theory in lfl3l or in (8) where the stability 
of clusters depends on the size of the clusters. An application of a stability criterion to bandwidth 
selection for mean shift partitioning was introduced in J4). The idea is that a partition should not 
change when a small variation is applied to the bandwidth. 

As no statistical methods are currently well adapted to the variable bandwidth estimation in high 
dimensional heterogeneous data, we decided to use the stability to validate the partitions and to find 
the best bandwidth at each point. The basic principle of the method that we propose is based on the 
one in [4|. The goal is to find the best bandwidth at each point within a set of predefined bandwidths. 
Given a set of B predefined matrices {H (b) ,6 = 1,..., B}, the best bandwidth, denoted as T(x w ), in 
this predefined set, at each point x (l) indexed by i, is the one that gives the most stable clusters. The 
method is composed of two main steps. 

The first step is called bandwidth evaluation at the partition level. The mean shift partitioning is 
run for each of the predefined matrices. For each scale b of this range, the data is divided into a certain 
number of clusters. For simplicity we introduce the function c which, for each scale b, associates a 
data point indexed by i to its corresponding cluster. If the i-th data point belongs to the u-th cluster at 
scale b, then c(i, b) = u. Each cluster is then represented parametrically. Indeed the stability criteria 
chosen in lUJ asserts that if the clusters can be represented by normal laws, then the best cluster is the 
one for which the normal law is the most stable. If few points are added to the partition or if some 
are left apart, the distribution of the cluster should not change. The assumption of normality seems 
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reasonable in a small neighborhood of a point, this neighborhood being found by the partitioning. 
Each cluster indexed by u at scale b is then represented parametrically by a normal law A/"(/il il> , £« ) . 
Let ci b) be the set of indices of points belonging to a cluster u at scale b, ci 6) = {i/c(i, b) = u}. The 
mean fi^' corresponding to cluster it at scale b is defined as: 

& = i E * W ■ (23) 



1 " 1 iGCi 



and the empirical covariance as: 



^ = iE (* W - M?>)(x W - Ml b) ) T • (24) 

These expectation and covariance estimates are easily corrupted by non-Gaussian tails which might 
occur. In H, other formula have been established to solve this problem. However, the proposed 
covariance does not seem reliable since it can be negative. Therefore, in the sequel, we will consider 
that all the means and covariances are computed with the traditional definitions. This choice gave 
satisfactory results for all the tests we have run but we believe that further work should concentrate 
on finding a better way to compute the covariance based on existing techniques for robust covariance 
matrix estimation lfl8l l25ll . 

After building all the normal laws, each point is associated at each scale to the law of the cluster it 
belongs to. The point indexed by i is associated for scale b to the distribution p' 6) = Miji^l b y S^' b ^). 

The second step evaluates for each point the clusters to which this point was associated and 
finds the most stable one. This second step is called bandwidth evaluation at the data level. It 
mainly consists in the comparison of the clusters, through the comparison of the normal laws. Several 
divergence measures between multiple probability distributions have been studied in literature. In 0, 
the authors use the Jensen-Shannon divergence to compare the distributions. Given r d-variate normal 
distributions pj , j = 1, . .. ,r, defined by their mean /ij and covariance Ej, the Jensen-Shannon 
divergence is defined as: 

JS(P1 ■ ..Pr) = ^log E f 1 + \ - \ i» T (E ^r\H \ X» ■ (25) 

ylIf=ll S jl 3=1 3=1 3=1 3=1 

The comparison is done between three neighboring scales (r = 3) and for each domain independently, 
the distributions being p^ 1 , p\ and p\ +1 . The best scale b* = argmin h JS(p[ b , pf \ pf ) is the 
one for which the Jensen-Shannon divergence, 



js( P r ± ',pr,pr ± ') = ^o, 

, 6+1 , 6+1 6+1 6+1 

2 V^c(i,&) 3 ^c{i,b)) \ c{i,b)> y^c{i.b) 3 Pclifi)) 

3=6-1 3=6-1 i=6-l 3=6-1 



(26) 
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is minimized. The final best bandwidth for the point x w is the predefined matrix that gave the 
most stable cluster: T(x ( - l) ) = H 1 - 6 This is in the contrast with the original method in [4], where 
Y(x (l) ) = S^^.j. The latter choice does not guarantee that the estimated bandwidth lies between 
the extremal bandwidths of the selected range (H' 1 ) and H' b ) when they are sorted). 

These two steps are described in Figure Q] The final partition of the data is obtained by rerunning 
a variable or a pseudo balloon mean shift partitioning using the selected matrices. Unfortunately, this 
algorithm is limited to data composed of one feature space. Therefore, in next subsection we propose 
an iterative algorithm, based on the previous method, that handles the heterogeneity of the data. 



Bandwidth evaluation at the partition level: Evaluate the bandwidth at the data level: 

- Mean shift partitioning at each scale - For each point compute the threefold JS divergences 

- Compute the distribution for each cluster - The bandwidth corresponds to the lowest divergence 



b 




Figure 1 : Scheme of an iteration of our algorithm. 

4.2 Handling heterogeneity: iterative selection 

For high dimensional heterogeneous data the set of predefined bandwidths can become large. Indeed, 
if the different domains of multidimensional data are independent, the bandwidth for each feature 
space should be chosen independently. The most stable cluster is not obtained for the same scale 
for all the domains. A solution could be to define a set of bandwidths for each feature space and 
to partition the data using all the possible bandwidth matrices resulting from the combination of the 
different sets. However as the dimensions become high and/or the sets of predefined bandwidths 
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become larges, the algorithm can become computationally very expensive: if we have a range of B p 
analysis bandwidths for each feature space p, the mean shift partitioning has to be run nf=i B P times 
to take into account every possibility. Thus the algorithm is not adapted for spaces composed of 
several independent components. The solution is then to find the best bandwidth iteratively for each 
feature space, so that the mean shift partitioning is run only $^p=i B P times. 

Suppose that we are trying to select the best bandwidth at each data point for the first feature 
space. We fix temporary matrices H p , p = 2, . . . , P for each of the other feature spaces. These 
matrices are constant for all scales and equal to the mean over all the B p possible matrices: 

1 Bp 

K p = —Y J ^\p>l . (27) 

p 6=1 

The bandwidth selection algorithm previously defined (Figure [TJ is run for the matrix range 

{fiW =diag[Hf ) ,H 2) ...,H P ],6=l,...,B 1 } 

and finds the best bandwidth Yi(x^) for each point x (,) . The same procedure is then run for every 
other feature space. The difference is that for the feature spaces that have already been studied the 
bandwidth matrix is not constant anymore: 

HW(x(' i )) = diag[T 1 (xW),...,T,_ 1 (xW),HW,H p+1 ...H P ] . (28) 

A variable bandwidth mean shift must then be used. As the dimension of the data is higher than 3, 
we prefer the balloon based mean shift partitioning, but the sample point estimator could be used as 
well within the same procedure. 

4.3 Bandwidth selection final algorithm 

The proposed iterative algorithm solves the bandwidth selection for high dimensional heterogeneous 
data problem. Each feature space is processed successively in two stages. The first stage consists in 
partitioning the data for each scale and building a parametric representation of each cluster. The sec- 
ond stage selects for each data point the most stable cluster which finally leads to the best bandwidth. 
The final algorithm is presented in algorithm^ 

5 Experimental results 

This section presents some results of our method on color image segmentation. The final partition 
of the data is obtained by applying a last time the pseudo balloon mean shift partitioning with the 
selected variable bandwidths. The data set is the set of all pixels of the image. To each data point 
is associated a 5-dimensional feature vector: two dimensions for the position and three for the color. 
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Algorithm 4 Iterative estimation of bandwidths 



Given a set of B p predefined bandwidths {H p 6) , b = 1 . . . B} for each feature space p. The bandwidth 
selection is as follows. 
Forp= 1,.. .,P 

• Evaluate the bandwidth at the partition level: For all b = 1, . . . , B 

1 . For all p' = p + 1, . . . , P, compute H p / : 

p 6=1 

2. Define, for i = 1, . . . , n, 

{fiW( x (0) = diag[Ti(x«), . . . , T„_i(x«),H?\ H P+ i . . .Hp], 6 = 1, . . . , B p }. 

3. Partition the data using the balloon mean shift partitioning. The result is clusters 
denoted as ci ft) , u = 1 . . . . Introduce the function c that associates a point indexed by 
i to its cluster u: c(i, b) = u. 

4. Compute the parametric representation M{iiu of each partition using: 



u ib) - — — V x (l) 
1 u 1 ieci 1 



(30) 



and 



S « 5 = i E (x W -M? ) )(x«-M? ) ) T = diag[sS,... ) Sl%] . (31) 



5. Associate to each point x w the mean /r^ 6) p and covariance b) of the cluster it 
belongs to and the corresponding normal distribution pr^. 

Evaluate the bandwidth at the data level: For each point x' l) 

1. Select the scale b* giving the most stable normal distribution by solving: 

6* = »igrmn r=2 _ B _ 1 JS(p^ 1 \p%,p^ 1) ) (32) 

where JS is the Jensen-Shanon divergence defined by equation 

2. The best bandwidth T p (x (i) ) is Hf 1 . 



We here consider the independency of all the dimensions, i.e. 5 features spaces each composed of 
one dimension. The order in which the dimensions are processed by our algorithm is the following: x 
coordinate, y coordinate, red, green and blue channels. In the final subsection we discuss the influence 
of the order in which the feature spaces are processed. For all the results presented in this section the 
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same predefined bandwidths are used. For all the feature spaces, we used 9 predefined bandwidths in 
the range of 10-30. Of course this range is large and it would be better to adapt it to each image, for 
example by using some information on the noise as in J2), but it is sufficient to validate our algorithm. 
The color of a pixel in the segmented images corresponds to the color of its associated mode. 

The two novel features of our algorithm are successively validated with comparisons to other 
methods. First we validate the iterative bandwidth selection in independent feature spaces by com- 
paring our algorithm with the same method in which the bandwidths evolve jointly in the different 
feature spaces. We then compare the variable mean shift based on the sample point estimator and the 
pseudo balloon mean shift. Several results for each of these two points are shown. 

5.1 Validation of the iterative bandwidth selection 

We start by validating the iterative selection on several examples. The comparison is done between 
our algorithm with five feature spaces and our algorithm with a single five-dimensional feature space. 
In the last case, all dimensions are consider dependent and the same scale is finally selected for all 
dimensions. 

The first results are presented on an outdoor image. Figure [2] shows the final partitioning for the 
non iterative selection Figure EJb)) and the iterative selection Figure 12(c)). With the non-iterative 
algorithm, 21 clusters were found, while the iterative method gave 31 clusters. At the end of the 
segmentation the sky and the mountains are merged together with the non iterative algorithm. Differ- 
ences are also visible on the mountains, in which the iterative method gave more clusters. In figure 




a) b) c) 

Figure 2: Validation of the iterative selection on the outdoor image, a) Original image; b) Non 
iterative bandwidth selection; c) Iterative bandwidth selection. 



|3] the evolution through scales of the mean shift partitioning that corresponds to the first step of the 
non iterative algorithm ("evaluate the bandwidth at the partition level") is shown. The evolution for 
our iterative algorithm is presented in figure [4] This time the evolution is shown through scales and 
through feature spaces. Because bandwidths evolve jointly in the different feature spaces with the 
non-iterative algorithm, many details are rapidly lost. Our algorithm allows more stability between 
two consecutive scales. 
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6 = 2 6 = 4 6 = 6 6=8 

Figure 3: Evolution through scales of the parti tionings for our non iterative algorithm. 

9=1 




6=2 6=4 6=6 6=8 



Figure 4: Evolution through scales and feature spaces of partitionings with our algorithm. 

We show other segmentation results on the hand image (Figure|5]a)). For the first one 13 clusters 
are found against 37 for the second. The ring and the nails are not detected by the non iterative 
method because the selected color bandwidths are too large. The reason is that the regions to be 
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segmented are large, leading to large position bandwidths. Because the bandwidths for position and 
color are chosen jointly, the color bandwidth are also too large as a result. 




a) b) c) 

Figure 5: Validation of the iterative selection on the hand image, a) Original image; b) Non iterative 
bandwidth selection; c) Iterative bandwidth selection. 



A last result is presented on the bull image (Figure|6]a)). Major differences between the segmenta- 
tions obtained with the two methods are visible on the bull itself. In particular, the iterative algorithm 
keeps more details on the head of the animal. The number of clusters found by the two algorithms are 
not so different though. Indeed, the iterative method found 136 clusters while the non-iterative one 
gave 130 clusters. While more important details are kept on the bull by the iterative method, some 
little (but less important) clusters are lost on the grass. 




a) b) c) 

Figure 6: Validation of the iterative selection on the bull image, a) Original image; b) Non iterative 
bandwidth selection; c) Iterative bandwidth selection. 



A surprising result concerns the computational cost. One could think that the non iterative method 
would be much faster than our iterative algorithm. This is not the case, even sometimes the iterative 
selection is faster. This can be explained as follows. While the first iterations are run for large 
bandwidths (mean over all the predefined bandwidths), in subsequent iterations, the best bandwidths 
have been chosen for the first feature spaces. These bandwidths are more adapted and lead to faster 
computation of the mean shift partitionings. 

To conclude, the iterative method permits to keep more details than the non iterative one, even if 
the results are, sometimes, visually close. With the iterative method, it is not the same scale that is 
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chosen for all the dimensions. Furthermore, the introduction of our iterative selection method does 
not cause any computation overhead. 

5.2 Validation of the pseudo balloon mean shift partitioning 

A novelty of our approach is the introduction of the pseudo balloon mean shift partitioning. We 
compare this partitioning method to the variable mean shift partitioning introduced by Comaniciu 
in 0. In 11231 it has been shown that the balloon estimator gives good results when the number of 
dimensions increases, which led to its use in this paper. The comparison is done as follows. The 
bandwidth selection using the pseudo balloon mean shift partitioning as described in this paper is 
first run. Then using the selected bandwidths, the variable bandwidth and the pseudo balloon mean 
shift partitioning are run and give the final segmentations. 

Here again we start by showing the results on the outdoor image. Figure [7] shows the final parti- 
tioning for the variable mean shift based on the sample point estimator (b)) and the pseudo balloon 
mean shift (c)). Nearly the same results were obtained by the two partitioning method. The sample 
point estimator gave 29 clusters against 31 for the pseudo balloon mean shift partitioning. Few tiny 
differences can be found in the clouds. 




a) b) c) 

Figure 7: Validation of "pseudo balloon mean shift" on the outdoor image, a) Original image; b) 
Variable mean shift partitioning; c) Pseudo balloon mean shift partitioning. 



The next result is on the hand image (Figure [8). The segmentations are again very close with 
the two estimators. The variable mean shift partitioning j7j| gave 35 clusters and the pseudo balloon 
37. The segmentation of the forefinger for the variable sample point mean shift is slightly less clean 
(composed of two clusters). 

We end this subsection by presenting the segmentation results on the bull image (Figure[9]). Con- 
trary to the two previous results, many differences are visible between the final partitioning obtained 
with the variable mean shift based on the sample point estimator (Figure[9tb)), which gave 128 clus- 
ters, and the pseudo balloon mean shift (c), which led to 136 clusters. The pseudo balloon mean shift 
permits to keep more details on the head of the bull. Also, the clusters found on the back are less 
messy than the ones found with the algorithm using the sample point estimator. 



RR n° 6286 



22 



Bugeau & Perez 




a) b) c) 

Figure 8: Validation of the "pseudo balloon mean shift" on the hand image, a) Original image; b) 
Variable mean shift partitioning; c) Pseudo balloon mean shift partitioning. 




a) b) c) 

Figure 9: Validation of the "pseudo balloon mean shift" on the bull image, a) Original image; b) 
Variable mean shift partitioning; c) Pseudo balloon mean shift partitioning. 

To conclude, the results presented in this subsection show that the pseudo balloon mean shift 
partitioning can be as good (or even better) as the variable mean shift of Comaniciu |7). In addition, 
the balloon estimator is more adapted to high-dimensional (d > 3) heterogeneous data {e.g. five- 
dimensional data in color segmentation), thanks to the iterative bandwidth selection we introduced in 

El 

5.3 Ordering the feature spaces 

One could wonder if the order in which the feature spaces are studied is important. In fact it has only 
a small influence (Figures [TOl and fTTTi. These results have been obtained using 9 bandwidths in the 
range 3-20. As judging a segmentation depends on the subsequent application, defining the order in 
which the feature spaces should be processed is at this stage not really possible. An intuition would 
be to start with the position before processing successively the feature spaces having the highest noise 
or the highest contrast in the image. 

6 Conclusion 

Automatic bandwidth selection for kernel bandwidth estimation has become an important research 
area as the popularity of mean shift methods for image and video segmentation increases. Several 



INRIA 



Bandwidth selection for kernel estimation in mixed multi-dimensional spaces 



23 




a) 



b) 



Figure 10: Ordering the feature spaces. Results of the balloon mean shift partitioning on the outdoor 
image when the feature spaces are ordered in the following ways: a) x-coordinate, y-coordinate, red 
channel, green channel, blue channel; b) blue channel, green channel, red channel, y-coordinate, 
x-coordinate. 



II 1 II 



a) 



b) 



Figure 11: Ordering the feature spaces. Results of the balloon mean shift partitioning on the baboon 
image when the feature spaces are ordered in the following ways: a) x-coordinate, y-coordinate, red 
channel, green channel, blue channel; b) blue channel, green channel, red channel, y-coordinate, 
x-coordinate. 



methods already exist in literature but none of them is really adapted to the case of multidimensional 
heterogeneous features. This is the problem we addressed in this paper. To this end, we first intro- 
duced the pseudo balloon mean shift filtering and partitioning to which the kernel bandwidth selection 
was applied. The convergence of this filtering method has been proved. Following |@J, the selection is 
is based on the intuition that a good partition must be stable through scales. The bandwidth selection 
method is based on an iterative selection over the different feature subspaces. It allows a richer search 
of optimal analysis bandwidths than the non iterative method in H. The validity of our algorithm was 
shown on color image segmentation. Note that our algorithm has also been used for motion detection 
in 0, leading to very promising results. A direction of future research concerns the computation of 
covariance matrices. It would indeed be valuable to devise a new way of computing them that permits 
to capture the distribution near cluster's mode and the tails of a cluster. 
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Appendix 

Proof of convergence of the pseudo balloon mean shift filtering 

The balloon kernel density estimator is defined as: 

n 

/w = ^Ep7^p fc (ii H ( x )" 1/2 ( x - xW )ii 2 ) • < a -d 

The proof of convergence of mean shift filtering using this estimator is closed to the one of the fixed 
bandwidth mean shift filtering JSJ. We first show that / is convergent for the trajectory points defined 
in algorithm[T] i.e. that /(y (j) ) converges when j becomes large if m(x) is defined as in d22l . Since n 
is finite, / is bounded : < f(x) < ^^p^ . It is then sufficient to show that / is strictly increasing 
or decreasing. Since the bandwidth H(x) is constant for all trajectory points y (i) associated to the 
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estimation point x, we get: 

/(y°' +1) )-/(y (j) ) 

n 

= n |H(x)|V2 E (*(HH(x)-V»(yW+i) - x«)|| 2 ) - k(\\H( X )-^(y^ - x«)|| 2 ) 

(A.2) 

The convexity of the profile fc implies that: 

V^ijXg) S [0,+oo) k(x 2 ) > k(xi) + k'(xi)(x 2 - xi) , 

and thus: 

/(y (i+1) )-/(y w ) > n | H( x)|V 2 E^dlHCx)- 1 ^^ -xW)f) 

(||H(x)- 1 / 2 (yW+ 1 ) -xW)|| 2 - ||H(x)-Va(y«) -xW)|| 2 ). 

(A.3) 

We assume that H(x) T = H(x). Developing the last term and using the definition of the mean shift 
vector (equation[22l implies after some manipulations: 

/(y (j ' +1) )-/(y (j) ) 

n 

- n[H(x)|V2 E fc/ (H H ( x )' 1/2 (y (j) -x^)|| 2 )(||H(x)-^ (y (W) _ y 0-) ) |, 2 



(A.4) 

Summing terms of previous equation for index j,j + 1, . . . ,j + m — 1, and introducing 

M = argmin ;>0 fc(||H(x)- 1/2 (y (i) -x w )|| 2 results in: 

f(y u+m) ) - /(y (j) ) > ITT , Cfc , |1/2 M|[H(x)- 1 /2( y (i+") _ y (i))||2 > o. (A5) 

We have shown that the sequence {/(y^)}j=i,2... is strictly increasing, bounded, and thus conver- 
gent. It is also a Cauchy sequence. Inequality JA.Sb implies that {y^} 3 - = i,2... is also a Cauchy 
sequence with respect to Mahalanobis norm, hence with respect to Euclidean norm (by vertue of 
norm equivalence in R d ). This proves the convergence of trajectory points towards a local mode of 
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